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ABSTRACT 

Pilot-Job systems play an important role in supporting dis¬ 
tributed scientific computing. They are used to consume more 
than 700 million CPU hours a year by the Open Science Grid 
communities, and by processing up to 1 million jobs a day for 
the ATLAS experiment on the Worldwide LHC Computing 
Grid. With the increasing importance of task-level paral¬ 
lelism in high-performance computing, Pilot-Job systems 
are also witnessing an adoption beyond traditional domains. 
Notwithstanding the growing impact on scientific research, 
there is no agreement upon a definition of Pilot-Job system 
and no clear understanding of the underlying abstraction 
and paradigm. Pilot-Job implementations have proliferated 
with no shared best practices or open interfaces and little 
interoperability. Ultimately, this is hindering the realization 
of the full impact of Pilot-Jobs by limiting their robustness, 
portability, and maintainability. This paper offers a com¬ 
prehensive analysis of Pilot-Job systems critically assessing 
their motivations, evolution, properties, and implementation. 
The three main contributions of this paper are: (i) an anal¬ 
ysis of the motivations and evolution of Pilot-Job systems; 
(ii) an outline of the Pilot abstraction, its distinguishing logi¬ 
cal components and functionalities, its terminology, and its 
architecture pattern; and (iii) the description of core and 
auxiliary properties of Pilot-Jobs systems and the analysis of 
seven exemplar Pilot-Job implementations. Together, these 
contributions illustrate the Pilot paradigm, its generality, 
and how it helps to address some challenges in distributed 
scientific computing. 
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Pilot-Jobs provide a multi-stage mechanism to execute 
workloads. Resources are acquired via a placeholder job and 
subsequently assigned to workloads. Pilot-Jobs are having 
a high impact on scientific and distributed computing lj. 
They are used to consume more than 700 million CPU hours 
a year [2j by the Open Science Grid (OSG) 3,4] communi¬ 
ties, and process up to 1 million jobs a day [5] for the ATLAS 
experiment on theLarge Hadron Collider (LHC) [7] Com¬ 
puting Grid (WLCG) [8,9. A variety of Pilot-Job systems 
are used on distributed computing infrastructures (DCI): 
Glidein/GlideinWMS 10 11 , the Coaster System _[12|, DI¬ 


ANE llL DIRAC JjrfTPanDA p], GWPilot [16[7Nim- 
rod/G 1^, Falkon 18] ) MyCluster m to name a few. 

A reason for the success and proliferation of Pilot-Job 
systems is that they provide a simple solution to the 
rigid resource management model historically found in high- 
performance and distributed computing. Pilot-Jobs break 
free of this model in two ways: (i) by using late binding 
to make the selection of resources easier and more effec¬ 
tive [20||22] ; and (ii) by decoupling the workload specifica¬ 
tion from the management of its execution. Late binding 
results in the ability to utilize resources dynamically, i.e., 
the workload is distributed onto resources only when they 
are effectively available. Decoupling workload specification 
and execution simplifies the scheduling of workloads on those 
resources. 

In spite of the success and impact of Pilot-Jobs, we perceive 
a problem: the development of Pilot-Job systems has not 
been grounded on an analytical understanding of underpin¬ 
ning abstractions, architectural patterns, or computational 
paradigms. The properties and functionalities of Pilot-Jobs 
have been understood mostly, if not exclusively, in relation 
to the needs of the containing software systems or on use 
cases justifying their immediate development. 

These limitations have also resulted in a fragmented soft¬ 
ware landscape, where many Pilot-Job systems lack general¬ 
ity, interoperability, and robust implementations. This has 
led to a proliferation of functionally equivalent systems mo¬ 
tivated by similar objectives that often serve particular use 
cases and target particular resources. 

Addressing the limitations of Pilot systems while improving 
our general understanding of Pilot-Job systems is a priority 
due to the role they will play in the next generation of high- 
performance computing. Most existing high-performance 
system software and middleware are designed to support 
the execution and optimization of single tasks. Based on 
their current utilization, Pilot-Jobs have the potential to sup- 
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port the growing need for scalable task-level parallelism and 
dynamic resource management in high-performance comput- 



The causes of the current status quo of Pilot-Job systems 
are social, economic, and technical. While social and eco¬ 
nomic considerations may play a determining role in promot¬ 
ing fragmented solutions, this paper focuses on the technical 
aspects of Pilot-Jobs. We contribute a critical analysis of 
the current state of the art describing the technical motiva¬ 
tions and evolution of Pilot-Job systems, their characterizing 
abstraction (the Pilot abstraction), and the properties of 
their most representative and prominent implementations. 
Our analysis will yield the Pilot paradigm, i.e., the way in 
which Pilot-Jobs are used to support and perform distributed 
computing. 

The remainder of this paper is divided into four sections. 0 
offers a description of the technical motivations of Pilot-Job 
systems and of their evolution. 

In l|3] the logical components and functionalities consti¬ 
tuting the Pilot abstraction are discussed. We offer a termi¬ 
nology consistent across Pilot-Job implementations, and an 
architecture pattern for Pilot-Jobs systems is derived and 
described. 

In Sjf] the focus moves to Pilot-Job implementations and 
to their core and auxiliary properties. These properties 
are described and then used alongside the Pilot abstraction 
and the pilot architecture pattern to describe and compare 
exemplar Pilot-Job implementations. 

In S|5] we outline the Pilot paradigm, arguing for its gen¬ 
erality, and elaborating on how it impacts and relates to 
both other middleware and applications. Insight is offered 
about the future directions and challenges faced by the Pilot 
paradigm and its Pilot-Job systems. 

2. EVOLUTION OF PILOT-JOB SYSTEMS 

Three aspects of Pilot-Jobs are investigated in this pa¬ 
per: the Pilot-Job system, the Pilot-Job abstraction, and 
the Pilot-Job paradigm. A Pilot-Job system is a type of soft¬ 
ware, the Pilot-Job abstraction is the set of properties of that 
type of software, and the Pilot-Job paradigm is the way in 
which Pilot-Job systems enable the execution of workloads 
on resources. For example, DIANE is an implementation of 
a Pilot-Job system; its components and functionalities are 
elements of the Pilot-Job abstraction; and the type of work¬ 
loads, the type of resources, and the way in which DIANE 
executes the former on the latter are features of the Pilot-Job 
paradigm. 

This section introduces Pilot-Job systems by investigating 
their technical origins and motivations alongside the chronol¬ 
ogy of their development. 

2.1 Technical Origins and Motivations 

Five features need elucidation to understand the techni¬ 
cal origins and motivations of Pilot-Job systems: task-level 
distribution and parallelism, master-worker pattern, multi¬ 
tenancy, multi-level scheduling, and resource placeholding. 
Pilot-Job systems coherently integrate resource placeholders, 
multi-level scheduling, and coordination patterns to enable 
task-level distribution and parallelism on multi-tenant re¬ 
sources. The analysis of each feature clarifies how Pilot-Job 
systems support the execution of workloads comprised of 
multiple tasks on one or more distributed machine. 

Task-level distribution and parallelism on multiple 


resources can be traced back to 1922 as a way to reduce the 
time to solution of differential equations 24 . In his Weather 


Forecast Factory [25] , Lewis Fry Richardson imagined dis¬ 
tributing computing tasks across 64,000 “human computers” 
to be processed in parallel. Richardson’s goal was exploit¬ 
ing the parallelism of multiple processors to reduce the time 
needed for the computation. Today, task-level parallelism is 
commonly adopted in weather forecasting on modern high 
performance machine^] as computers. Task-level parallelism 
is also pervasive in computational science [26] (see Ref. 27 
and references therein). 

Master-worker is a coordination pattern commonly used 
for distributed computations [28f[32| . Submitting tasks to 
multiple computers at the same time requires coordinating 
the process of sending and receiving tasks; of executing them; 
and of retrieving and aggregating their outputs [33] . In 
the master-worker pattern, a “master” has a global view 
of the overall computation and of its progress towards a 
solution. The master distributes tasks to multiple “workers”, 
and retrieves and aggregates the results of each worker’s 
computation. Alternative coordination patterns have been 
devised, depending on the characteristics of the computed 
tasks but also on how the system implementing task-level 


distribution and parallelism has been designed 34 


Multi-tenancy defines how high-performance machines 
are exposed to their users. Job schedulers, often called “batch 
queuing systems” [35] and first used in the time of punched 
cards SH- adopt the batch processing concept to promote 
efficient and fair resource sharing. Job schedulers enable 
users to submit computational tasks called “jobs” to a queue. 
The execution of these jobs is delayed waiting for the required 
amount of the machine’s resources to be available. The extent 
of delay depends on the number, size, and duration of the 
submitted jobs, resource availability, and policies (e.g., fair 
usage). 

The resource provisioning of high-performance machines 
is limited, irregular, and largely unpredictable [38|"4l] . By 
definition, the resources accessible and available at any given 
time can be fewer than those demanded by all the active 
users. The resource usage patterns are also not stable over 
time and alternating phases of resource availability and star¬ 
vation are common |42[|43] . This landscape has promoted 
continuous optimization of the resource management and 
the development of alternative strategies to expose and serve 
resources to the users. 

Multi-level scheduling is one of the strategies used to 
improve resource access across high-performance machines. 
In multi-level scheduling, a global scheduling decision results 
from a set of local scheduling decisions M- For example, 
an application submits tasks to a scheduler that schedules 
those tasks on the schedulers of individual high-performance 
machines. While this approach can increase the scale of 
applications, it also introduces complexities across resources, 
middleware, and applications. 

Several approaches have been devised to manage these com¬ 
plexities [46 54 but one of the persistent issues is the increase 
of the implementation burden imposed on applications. For 
example, in spite of progress made by grid computing |55||56 


1 A high-performance machine indicates a cluster of comput¬ 
ers delivering higher performances than single workstations or 
desktop computers, or a resource with adequate performance 
to support multiple science and engineering applications con¬ 
currently. 
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to transparently integrate diverse resources, most of the re¬ 
quirements involving the coordination of task execution still 
reside with the applications [57 59 . This translates into 
single-point solutions, extensive redesign and redevelopment 
of existing applications when adapted to new use cases or 
new high-performance machines, and lack of portability and 
interoperability. 

Resource placeholders are used as a pragmatic solution 
to better manage the complexity of executing applications. 
A resource placeholder decouples the acquisition of compute 
resources from their use to execute the tasks of an application. 
For example, resources are acquired by scheduling a job onto a 
high-performance machine which, when executed, is capable 
of retrieving and executing application tasks itself. 

Resource placeholders bring together multi-level scheduling 
and task-level distribution and parallelism. Placeholders are 
scheduled on one or more machines and then multiple tasks 
are scheduled at the same time on those placeholders. Tasks 
can then be executed concurrently and in parallel when the 
placeholders covers multiple compute resources. The master- 
worker pattern is often an effective choice to manage the 
coordination of tasks execution. 

It should be noted that resource placeholders also mitigate 
the side-effects of multi-tenancy. A placeholder still spends 
a variable amount of time waiting to be executed on a high- 
performance machine, but, once executed, the application 
exerts total control over the placeholder resources. In this 
way, tasks are directly scheduled on the placeholder without 
competing with other users for the same resources. 

Resource placeholders are programs with specific queuing 
and scheduling capabilities. They rely on jobs submitted 
to a high-performance machine to execute a program with 
diverse capabilities. For example, jobs usually execute non 
interactive programs, but users can submit jobs that execute 
terminals, debuggers, or other interactive software. 


2.2 Chronological Evolution 

Figure [I] shows the introduction of Pilot-Job systems over 
time alongside some of the defining milestones of their evo¬ 
lution]^] This is an approximated chronology based on the 
date of the first publication, or when publications are not 
available, on the date of the systems’ code repository. 

The evolution of Pilot-Job systems began with the imple¬ 
mentation of resource placeholders to explore application-side 
task scheduling and high-throughput task execution. Pro¬ 
totypes of Pilot-Job systems followed, eventually evolving 
into production-grade systems supporting specific types of 
applications and high-performance machines. Recently, Pilot 
systems have been employed to support a wide range of work¬ 
loads and applications (e.g., MPI, data-driven workflows, 
tightly and loosely coupled ensembles), and more diverse 
high-performance machines (e.g., MPI, data-driven work- 
flows, tightly and loosely coupled ensembles). 

AppLeS (Application Level Schedulers) [6l] offered an early 
implementation of resource placeholders. Developed around 
1997, AppLeS provided an agent that could be embedded 
into an application to acquire resources and to schedule tasks 
onto them. AppLeS provided application-level scheduling 

2 To the best of the authors’ knowledge, the term “pilot” 
was first c oine d in 2004 in the context of the WLCG Data 
Challenge [8|9| , and then introduced in writing as “pilot-agent” 
in a 2005 LHCb report [60 . 



Resource Grid LHC MPI Workfl. Sys. 

Placeholders Integration Adoption Capabilities HPC/Cloud 


Figure 1: Introduction of Pilot-Job systems over 
time alongside some exemplar milestones of their 
evolution. When available, the date of first mention 
in a publication or otherwise the release date of soft¬ 
ware implementation is used. 


but did not isolate the application from resource acquisition. 
Any change in the agent directly translated into a change of 
the application code. AppLeS Templates [62] was developed 
to address this issue, each template representing a class 
of applications (e.g., parameter sweep 6lf]) that could be 
adapted to the requirements of a specihc realization. 

Volunteer computing projects started around the same 
time as AppLeS was introduced. In 1997, the Great Inter¬ 
net Mersenne Prime Search effort [64| , shortly followed by 
distributed.net 65 competed in the RC5-56 secret-key chal¬ 
lenge 66 . In 1999, the SETI@Home project 67 was released 
to the public to analyze radio telescope data. The Berkeley 
Open Infrastructure for Network Computing (BOINC) frame¬ 
work 


grew out of SETI@Home in 2002 [69], becoming 
the de facto standard framework for volunteer computing. 

Volunteer computing implements a client-server architec¬ 
ture to achieve high-throughput task execution. Users install 
a client on their own workstation and then the client pulls 
tasks from the Server when CPU cycles are available. Each 
client behaves as a sort of resource placeholder, one of the 
core features of a Pilot-Job system as seen in §2.1[ 

HTCondor (formerly known as Condor) is a distributed 
computing framework [70] with a resource model similar to 
that of volunteer computing. Developed around 1988, Condor 
enabled users to execute tasks on a resource pool made 
of departmental Unix workstations. In 1996 Flocking [71 
implemented task scheduling over multiple Condor resource 
pools and, in 2002, “Glidein” i72] added grid resources to 
Condor pools via resource placeholders. 

Several Pilot-Job systems were developed alongside Glidein 
to benefit from the high-throughput and scale promised by 
grid resources. Around 2000, Nimrod/G 17 extended the pa¬ 
rameterization engine of Nimrod ’73] with resource placehold¬ 
ers. Four years later, the WISDOM (wide in silico docking 
on malaria) [74 project developed a workload manager that 
used resource placeholders on the EGEE (Enabling Grids 
for E-Science in Europe) grid 75 to compute the docking of 
multiple compounds, i.e. the molecules. 

The success of grid-based Pilot-Job systems and especially 
of Glidein reinforced the relevance of resource placeholders to 
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enable scientific computation but their implementation also 
highlighted two main challenges: user/system layer isolation, 
and application development model. For example, Glidein 
allowed for the user to manage resource placeholders directly 
but machine administrators had to manage the software re¬ 
quired to create the resource pools. Application-wise, Glidein 
enabled integration with application frameworks but did not 
programmatically support the development of applications 
by means of dedicated APIs and libraries. 

Concomitant and correlated with the development of 
LHC [76] there was a “Cambrian Explosion” of Pilot-Job 
systems. Approximately between 2001 and 2006, five 
major Pilot systems were developed: Distributed ANal- 
ysis Environment (DIANE) [77 78 , ALIce ENvironmen 
(AliEn) |79[|80| , Distributed Infrastructure with Remote 
Agent Control (DIRAC) [81||82| , Production and Distributed 
Analysis (PanDA) 83], and Glidein Workload Management 
System (GlideinWMS) [84p5] , These Pilot-Job systems were 
developed to serve user communities and experiments at the 
LHC: DIRAC is being developed and maintained by the 
LHCb experiment [86]; AliEn by ALICE 87]; PanDA by AT¬ 
LAS; and GlideinWMS by the US national group 88 of the 
CMS experiment [89 . 

The LHC Pilot-Job systems have been designed to be 
functionally very similar, work on almost the same underly¬ 
ing infrastructure, and serve applications with very similar 
characteristics. Around 2011, these similarities enabled Co- 
Pilot 90 [M] to support the execution of resource placeholders 
on cloud and volunteer computing [92] resource pools for all 
the LHC experiments. 

Pilot-Job systems development continued to support re¬ 
search, resources, middleware, and frameworks independent 
from the LHC experiments. T 0 P 0 S (Token Pool Server) [93] 
was developed around 2009 by SARA (Stichting Academisch 
Rckencentrum Amsterdam) [94] . T 0 P 0 S mapped tasks to 
tokens and distributed tokens to resource placeholders. A 
REST interface was used to store task definitions avoiding 
the complexities of the middleware of high-performance ma¬ 
chines [95] . 

Developed around 2011, BigJob 96] (now re-implemented 
as RADIOAL-Pilot [23] ) supported task-level parallelism 
on HPC machines. BigJob extended pilots to also hold 
data resources exploring the notion of “pilot-data” [97] and 
uses an interoperability library called “SAGA” (Simple API 
for Grid Applications) to work on a variety of computing 
infrastructures [96 98 


BigJob also offered application- 
level programmability of distributed applications and their 
execution. 

GWPilot [lOO built upon the GridWay meta-scheduler [lOl] 
to implement efficient and reliable scheduling algorithms. De¬ 
veloped around 2012, GWPilot was specifically aimed at grid 
resources and enabled customization of scheduling at the ap¬ 
plication level, independent from the resource placeholder 
implement at ion. 

Pilot-Job systems have also been used to support science 
workflows. For example, Corral 102] was developed as a 
frontend to Glidein and to optimize the placement of glideins 
(i.e., resource placeholders) for the Pegasus workflow sys¬ 
tem 103]. Corral was later extended to also serve as one of 
the frontends of GlideinWMS. BOSCO [104 , also a workflow 


management system, was developed to offer a unified job sub¬ 
mission interface to diverse middleware, including the Glidein 
and GlideinWMS Pilot-Job systems. The Coaster 12|, 105 



Figure 2: Diagrammatic representation of the logi¬ 
cal components and functionalities of Pilot systems. 
The logical components are highlighted in green, and 
the functionalities in blue. 


and Falkon [18] Pilot-Job systems were both tailored to 
support the execution of workflows specified in the Swift 


language 106 


3. THE PILOT ABSTRACTION 

The overview presented in ]2] shows a degree of heterogene¬ 
ity among Pilot-Job systems. These systems are implemented 
to support specific use cases by executing certain types of 
workload on machines with particular middleware. Imple¬ 
mentation details hide the commonalities and differences 
among Pilot-Job systems. Consequently, in this section we 
describe the components, functionalities, and architecture 
pattern shared by Pilot-Job systems. Together, these ele¬ 
ments comprise what we call the “pilot abstraction”. 

Pilot-Job systems are developed by independent projects 
and described with inconsistent terminologies. Often, the 
same term refers to multiple concepts or the same concept is 
named in different ways. We address this source of confusion 
by defining a terminology that can be used consistently across 
Pilot-Job systems, including the workloads they execute and 
the resources they use. 

3.1 Logical Components and Lunctionalities 

Pilot-Job systems employ three separate but coordinated 
logical components: a Pilot Manager, a Workload Man¬ 
ager, and a Task Manager (Figure [2|. The Pilot Manager 
handles the provisioning of one or more resource placeholders 
(i.e., pilots) on single or multiple machines. The Workload 
Manager handles the dispatching of one or more workloads 
on the available resource placeholders. The Task Manager 
handles the execution of the tasks of each workload on the 
resource placeholders. 

The implementation of these three logical components vary 
across Pilot-Job systems (see ©■ For example, two or more 
logical components may be implemented by a single software 
element or additional functionalities may be integrated into 
the three management components. 

The three logical components support the common func¬ 
tionalities of Pilot-Job systems: Pilot Provisioning, Task 
Dispatching, and Task Execution (Figure [2|. Pilot-Job 
systems have to provision resource placeholders on the tar¬ 
get machines, dispatch tasks on the available placeholders, 
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and use these placeholders to execute the tasks of the given 
workload. More functionalities may be needed to implement 
a production-grade Pilot-Job system as, for example, au¬ 
thentication, authorization, accounting, data management, 
fault-tolerance, or load-balancing. However, these functional¬ 
ities depend on the type of use cases, workloads, or resources 
and, as such, are not necessary to every Pilot-Job system. 

As seen in resource placeholders enable tasks to uti¬ 
lize resources without directly depending on the capabili¬ 
ties exposed by the target machines. Resource placeholders 
are scheduled onto target machines by means of dedicated 
capabilities, but once scheduled and then executed, these 
placeholders make their resources directly available for the 
execution of the tasks of a workload. 

The provisioning of resource placeholders depends on the 
capabilities exposed by the middleware of the targeted ma¬ 
chine and on the implementation of each Pilot-Job system. 
Provisioning a placeholder on middleware with queues, batch 
systems and schedulers, typically involves the placeholder be¬ 
ing submitted as a job. For such middleware, a job is a type 
of logical container that includes configuration and execution 
parameters alongside information on the application to be ex¬ 
ecuted on the machine’s compute resources. Conversely, for 
machines without a job-based middleware, a resource place¬ 
holder might be executed by means of other types of logical 
container as, for example, a virtual machine [i07p08] . 

Once placeholders control a portion of a machine resources, 
tasks need to be dispatched to those placeholders for execu¬ 
tion. Task dispatching is controlled by the Pilot-Job system, 
not by the targeted machine’s middleware. This is a defin¬ 
ing characteristic of Pilot-Job systems because it decouples 
the execution of a workload from the need to submit its 
tasks via the machine’s scheduler. Execution patterns involv¬ 
ing task and/or data dependences can thus be implemented 
independent of the constraints of the target machine’s middle¬ 
ware. Ultimately, this is how Pilot-Job systems can improve 
workload execution compared to direct submission. 

The three logical components of a Pilot-Job system - Work¬ 
load Manager, Pilot Manager, and Task Manager - need to 
communicate and coordinate in order to execute the given 
workload. Any suitable communication and coordination 
pattern [l09ill0| can be used and this pattern may be imple¬ 
mented by any suitable technology. In a distributed context, 
different network architectures and protocols may also be 
used to achieve effective communication and coordination. 

As seen in master-worker is a common coordination pat¬ 
tern among Pilot-Job systems. Workload and task Managers 
are implemented as separated modules, one acting as master 
and the other as worker. The master dispatches tasks while 
the workers execute them independent of each other. Alter¬ 
native coordination patterns can be used where, for example, 
Workload and Task Managers are implemented as a single 
module sharing dispatching and execution responsibilities. 

Data management can play an important role within a 
Pilot-Job system as most of workloads require reading input 
and writing output data. The mechanisms used to make 
input data available and to store and share output data 
depend on use cases, workloads, and resources. Accordingly, 
data capabilities other than reading and writing files like, for 
example, data replication, (concurrent) data transfers, non 
file-based data abstractions, or data placeholders should be 
considered special-purpose capabilities, not characteristic of 
every Pilot-Job system. 


3.2 Terms and Definitions 

In this subsection, we define a minimal set of terms related 
to the logical components and capabilities of Pilot-Job sys¬ 
tems. The terms “pilot” and “job” need to be understood in 
the context of machines and middleware used by Pilot-Job 
systems. These machines offer compute, storage, and net¬ 
work resources and pilots allow for the utilization of those 
resources to execute the tasks of one or more workloads. 

Task. A set of operations to be performed on a computing 
platform, alongside a description of the properties and 
dependences of those operations, and indications on how 
they should be executed and satisfied. Implementations 
of a task may include wrappers, scripts, or applications. 

Workload. A set of tasks, possibly related by a set of arbi¬ 
trarily complex relations. For example, relations may 
involve tasks, data, or runtime communication require¬ 
ments. 

The tasks of a workload can be homogeneous, heteroge¬ 
neous, or one-of-a-kind. An established taxonomy for work¬ 
load description is not available. We propose a taxonomy 
based upon the orthogonal properties of coupling, depen¬ 
dency, and similarity of tasks. 

Workloads comprised of tasks that are independent and 
indistinguishable from each other are commonly referred to 
as a Bag-of-Tasks (BoT) [Ill, 112]. Ensembles are workloads 
where the collective outcome of the tasks is relevant (e.g., 
computing an average property) 113 . The tasks that com¬ 
prise the workload in turn can have varying degrees and types 
of coupling; coupled tasks might have global (synchronous) 
or local (asynchronous) exchanges, and regular or irregular 
communication. We categorize such workloads as coupled 
ensembles independent of the specific details of the coupling 
between the tasks. A workflow represents a workload with ar¬ 
bitrarily complex relationships among the tasks, ranging from 
dependencies (e.g., sequential or data) to coupling between 
the tasks (e.g., frequency or volume of exchange) 52]. 

Resource. A description of a finite, typed, and physical en¬ 
tity utilized when executing the tasks of a workload. 
Compute cores, data storage space, or network band¬ 
width between a source and a destination are examples 
of resources commonly utilized when executing work¬ 
loads. 

Distributed Computing Resource (DCR). A system 
characterized by: a set of possibly heterogeneous re¬ 
sources, a middleware, and an administrative domain. 
A cluster is an example of a DCR: it offers sets of 
compute, data, and network resources; it deploys a 
middleware as, for example, the Torque batch system, 
the Globus grid middleware, or the OpenStack cloud 
platform; and enforces policies of an administrative 
domain like XSEDE, OSG, CERN, NERSC, or a Uni¬ 
versity. So called supercomputers or workstations can 
be other examples of DCR, where the term “distributed” 
refers to (correlated) sets of independent types of re¬ 
sources. 

Distributed Computing Infrastructure (DCI). A set 

of DCRs federated with a common administrative, 
project, or policy domain, also shared at the software 
level. The federation and thus the resulting DCI can be 


5 







dynamic, for example, a DCR that is part of XSEDE 
can be federated with a DCR that is part of OSG with¬ 
out having to integrate entirely the two administrative 
domains. 


Our definitions of resource and DCR might seem restrictive 
or inconsistent with how the term “resource” is sometimes 
used in the field of distributed computing. This is because 
the terms “DCR” and “resource” as defined here refer to the 
types of machine and to the types of computing resource they 
expose to the user. In its common use, the term “resource” 
conflates these two elements because it is used to indicate 
specific machines like, for example, Stampede, but also a 
specific computing resource as, for example, compute cores. 

The term “DCR” also offers a more precise definition of the 
generic term “machine”. DCR indicates a type of machine 
in terms of its resources, middleware, and administrative 
domain. These three elements are required to characterize 
Pilot-Job systems as they determine the type of resources that 
can be held by a pilot, the pilot properties and capabilities, 
and the administrative constraints on its instantiation. 

The use of the term “distributed” in DCR makes explicit 
that the aggregation of diverse types of resources may happen 
at a physical or logical level, and at an arbitrary scale. This 
is relevant because the set of resources of a DCR can belong 
to a physical or virtual machine as much as to a set of 
these entities 114 116], either co-located on a single site or 
distributed across multiple sites. Both a physical cluster of 
compute nodes and a logical cluster of virtual machines are 
DCRs as they have a set of resources, a middleware, and an 
administrative domain. 

The term “DCI”, commonly used to indicate a distributed 
computing infrastructure, is consistent with both “resource” 
and “DCR” as defined here. Diverse types of resource are 
collected into one or more DCR, and aggregates of DCRs 
that share some common administrative aspects or policy 
form a DCI. 

As seen in 0 most of the DCRs used by Pilot-Job systems 
utilize “queues”, “batch systems”, and “schedulers”. In these 
DCRs, jobs are scheduled and then executed by a batch 
system. 


Job. A type of container used to acquire resources on a 
DCR. 


When considering Pilot-Job systems, jobs and tasks are 
functionally analogous but qualitatively different. Function¬ 
ally, both jobs and tasks are containers, i.e. metadata wrap¬ 
pers around one or more executables often called “application” 
or “script”. Qualitatively, tasks are the functional units of a 
workload, while jobs are what is scheduled on a DCR. Given 
their functional equivalence, the two terms can be adopted 
interchangeably when considered outside the context of Pilot- 
Job systems. 

As described in §3.1[ a resource placeholder needs to be 
submitted to a DCR in order to acquire resources for the Pilot- 
Job. The placeholder needs to be wrapped in a container, 
e.g., a job, and that container needs to be supported by 
the middleware of the target DCR. For this reason, the 
capabilities exposed by the middleware of the target DCR 
determine the submission process of resource placeholders 
and its specifics. 

Pilot. A container (e.g., a “job”) that functions as a resource 


placeholder on a given infrastructure and is capable of 
executing tasks of a workload on that resource. 

A pilot is a resource placeholder that holds portion of a 
DCR’s resources. A Pilot-Job system is software capable of 
creating pilots so as to gain exclusive control over a set of 
resources on one or more DCRs, and then to execute the 
tasks of one or more workloads on those pilots. 

The term “pilot” as defined here is named differently across 
Pilot-Job systems. In addition to the term “placeholder”, pi¬ 
lots have also been named “job agent”, “job proxy”, “coaster”, 
and “glidein” mmm- These terms are used as syn¬ 
onyms, often without distinguishing between the type of 
container and the type of executable that compose a pilot. 

Until now, the term “Pilot-Job system” has been used 
to indicate those systems capable of executing workloads 
on pilots. For the remainder of this paper, the term “Pilot 
system” will be used instead, as the term “job” in “Pilot-Job” 
identifies just the way in which a pilot is provisioned on a 
DCR exposing specific middleware. The use of the term 
“Pilot-Job system” should be regarded as a historical artifact, 
indicating the use of middleware in which the term “job” was, 
and still is, meaningful. 

We have now defined resources, DCRs, and pilots. We have 
established that a pilot is a placeholder for a set of DCR’s 
resources. When combined, the resources of multiple pilots 
form a resource overlay. The pilots of a resource overlay can 
potentially be distributed over distinct DCRs. 

Resource Overlay. The aggregated set of resources of mul¬ 
tiple pilots possibly instantiated on multiple DCRs. 

As seen in §2.1| three more terms associated with Pilot 
systems need to be explicitly defined: “early binding”, “late 
binding”, and “multi-level scheduling”. 

The terms “binding” and “scheduling” are often used inter¬ 
changeably but here we use “binding” to indicate the asso¬ 
ciation of a task to a pilot and “scheduling” to indicate the 
enactment of that association. Binding and scheduling may 
happen at distinct points in time and this helps to expose 
the difference between early and late binding, and multi-level 
scheduling. 

The type of binding of tasks to pilots depends on the state 
of the pilot. A pilot is inactive until it is executed on a 
DCR, is active thereafter, until it completes or fails. Early 
binding indicates the binding of a task to an inactive pilot; 
late binding the binding of a task to an active one. 

Early binding is useful because by knowing in advance the 
properties of the tasks that are bound to a pilot, specific 
deployment decisions can be made for that pilot. For example, 
a pilot can be scheduled onto a specific DCR, because of 
the capabilities of the DCR or because the data required by 
the tasks are already available on that DCR. Late binding 
is instead critical to assure high throughput by enabling 
sustained task execution without additional queuing time or 
pilot instantiation time. 

Once tasks have been bound to pilots, Pilot systems are 
said to implement multi-level scheduling [5| |16|[54] because 
they include scheduling onto the DCR as well as scheduling 
onto the pilots. Unfortunately, the term “level” in multi-level 
is left unspecified making unclear what is scheduled and when. 
Assuming the term “entity” indicates what is scheduled, and 
the term “stage” the point in time at which the scheduling 
happens, “multi-entity” and “multi-stage” are better terms to 
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Figure 3: Diagrammatic representation of the logical 
components, functionalities, and core terminology of 
a Pilot system. The core terminology is highlighted 
in red, the logical components of a Pilot system in 
green, and the functionalities in blue. The compo¬ 
nents of a Pilot system are represented by boxes 
with a thicker border. 


describe the scheduling properties of Pilot systems. “Multi¬ 
entity” indicates that (at least) two entities are scheduled 
and “multi-stage” that such scheduling happens at separate 
moments in time. Pilot systems schedule pilots on DCR and 
tasks on pilots at different point in time. 

Early binding. Binding one or more tasks to an inactive 
pilot. 

Late binding. Binding one or more tasks to an active pilot. 

Multi-entity and Multi-stage scheduling. Scheduling pi¬ 
lots onto resources, and scheduling tasks onto (active 
or inactive) pilots. 

Figure [3] offers a diagrammatic overview of the logical 
components of Pilot systems (green) alongside their function¬ 
alities (blue) and the defined terminology (red). The figure 
is composed of three main blocks: the one on the top-left cor¬ 
ner represents the workload originator. The one starting at 
the top-right and shaded in gray represents the Pilot system, 
while the four boxes one inside the other on the left side of 
the figure represent a DCR. Of the four boxes, the outmost 
denotes the DCR boundaries, e.g., a cluster. The second box 
the container used to schedule a pilot on the DCR, e.g., a 
job or a virtual machine. The third box represents the pilot 
once it has been instantiated on the DCR, and the fourth 
box represents the resources held by the pilot. The boxes 
representing the components of a Pilot system have been 
highlighted with a thicker border. 

Figure [3] shows the separation between the DCR and the 
Pilot system, and how the resources on which tasks are exe¬ 
cuted are contained in the DCR within different logical and 
physical components. Appreciating the characteristics and 
functionalities of a Pilot system depends upon understanding 
the levels at which each of its component exposes capabilities. 
An application submits one or more workloads composed of 


tasks to the Pilot system via an interface (tag a). The Pi¬ 
lot Manager is responsible for pilot provisioning (tag b), the 
Workload Manager to dispatch tasks to the Task Manager 
(tag c), the Task Manager to execute those tasks once the 
pilot has become available (tag d). 

Note how in Figure [3] scheduling happens at the DCR 
(tag b), for example, by means of a cluster scheduler, and 
then at the pilot (tag c). This illustrates what here has been 
called “multi-entity” and “multi-stage” scheduling, replacing 
the more common but less precise term multi-level schedul¬ 
ing. The separation between scheduling at the pilot and 
scheduling at the Workload Manager highlights the four en¬ 
tities involved in the two-stage scheduling: jobs on DCR 
middleware, and tasks on pilots. This helps to appreciate 
the critical distinction between the container of a pilot and 
the pilot itself. A container is used by the Pilot Manager to 
provision the pilot. Once the pilot has been provisioned, it 
is the pilot and not the container that is responsible of both 
holding a set of resources and offering the functionalities of 
the Task Manager. 

Figure [3] should not be confused with an architectural di¬ 
agram. No indications are given about the interfaces that 
should be used, how the logical component should be mapped 
into software modules, or what type of communication and 
coordination protocols should be adopted among such compo¬ 
nents. This is why no distinction is made diagrammatic ally 
between, for example, early and late binding. 

Figure [ 3 ] is instead an architectural pattern [118 for sys¬ 
tems that execute workloads on multiple DCRs via pilot- 
based multi-entity, many-stage scheduling of tasks. This pat¬ 
tern can be realized into an architectural description and then 
implemented into a specific Pilot system. Several architec¬ 
tural models, frameworks, languages, supporting platforms, 
and standards are available to produce architectural descrip¬ 
tions TT9p20] . Common examples are 4+1 architectural 
view 121 , Open Distributed Processing (ODP) [122] , Zach- 
man 123 , The Open Group Architecture Framework (TO- 
GAF) [l24] , and the Attribute-Driven Design (ADD) [125] . 

4. PILOT SYSTEMS 

In this section we examine multiple implementations of Pi¬ 
lot systems. Initially, we derive core and auxiliary properties 
of Pilot system implementations from the components and 
functionalities described in §3.1| Subsequently, we describe 
a selection of Pilot system implementations showing how the 
architecture of each system maps to the architectural pattern 
presented in j ]3.2| Finally, we offer insight about the com¬ 
monalities and differences among the described Pilot system 
implementations discussing also their most relevant auxiliary 
properties. 

4.1 Core properties 

Core properties are specific to Pilot systems and necessary 
for their implementation. These properties characterize Pilot 
systems because they relate to pilots and how they are used 
to execute tasks. Without core properties Pilot Managers, 
Workload Managers, and Task Managers would not be capa¬ 
ble to provide pilots, and to dispatch and execute tasks. We 
list the core properties of Pilot systems in Table [l] 

The first three core properties - Pilot Scheduling, Pilot 
Bootstrapping, and Pilot Resources - relate to the procedures 
used to provision pilots and to the resources they hold. Pilots 
can be deployed by a Pilot Manager using a suitable wrapper 
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that can be scheduled on the targeted DCR middleware. 
Pilots become available only with a correct bootstrapping 
procedure, and they can be used for task execution only if 
they acquire at least one type of resource, e.g., compute cores 
or data storage. 

The Workload Binding and Workload Scheduling core prop¬ 
erties relate to how Pilot systems bind tasks to pilots, and 
then how these tasks are scheduled once pilots become avail¬ 
able. A Workload Manager can early or late bind tasks to 
pilots depending on the DCR’s resources and workload’s re¬ 
quirements. Scheduling decisions may depend on the number 
and capabilities of the available pilots or on the status of work¬ 
load execution. Workload Binding and Workload Scheduling 
enable Pilot systems to control the coupling between tasks 
requirements and pilot capabilities. 

The Workload Environment core property relates to the 
features and configuration of the environment provided by 
the pilot in which tasks are executed on the DCR. A Task 
Manager requires information about the environment to suc¬ 
cessfully manage the execution of tasks. For example, the 
Task Manager may have to make available supporting soft¬ 
ware or choose suitable parameters for the task executable. 
The following describes each core property. Note that these 
properties refer to Pilot systems and not to individual pilots 
instantiated on a DCR. 

• Pilot Scheduling. Modalities for scheduling pilots 
on DCRs. Pilot scheduling may be: fully automated 
(i.e., implicit) or directly controlled by applications or 
users (i.e., explicit); performed on a single DCR (i.e., 
local) or coordinated across multiple DCRs (i.e., global); 
tailored to the execution of the workload (i.e., adaptive) 
or predefined on the basis of policies and heuristics (i.e, 
static). 

• Pilot Bootstrapping. Modalities for pilot bootstrap¬ 
ping on DCRs. Pilots can be bootstrapped from code 
downloaded at every instantiation or from code that is 
bundled by the DCR. The design of pilot bootstrap¬ 
ping depends on the DCR environment and on whether 
single or multiple types of DCRs are targeted. For 
example, a design based on connectors can be used 
with multiple DCRs to get information about container 
type (e.g., job, virtual machine), scheduler type (e.g., 
PBS, HTCondor, Globus), amount of cores, walltime, 
or available filesystems. 

• Pilot Resources. Types and characteristics of the re¬ 
sources exposed by a Pilot system. Resource types are, 
for example, compute, data, or networking while some 
of the their typical characteristics are: size (e.g., number 
of cores or storage capacity), lifespan, intercommuni¬ 
cation (e.g., low-latency or inter-domain), computing 
platforms (e.g., x86 or GPU), file systems (e.g., local 
or distributed). The resource held by a pilot varies 
depending on the system architecture of the DCR in 
which the pilot is instantiated. For example, a pilot 
may hold multiple compute nodes, single nodes, or 
portion of the cores of each node. The same applies 
to file systems and their partitions or to physical and 
software-defined networks. 

• Workload Binding. Time of workload assignment to 
pilots. Executing a workload requires its tasks to be 


bound to one or more pilots before or after they are 
instantiated on a DCR. As seen in |J3| Pilot systems 
may allow for two modalities of binding between tasks 
and pilots: early binding and late binding. Pilot system 
implementations differ in whether and how they support 
these two types of binding. 

• Workload Scheduling. Enactment of a binding. Pi¬ 
lot systems can support (prioritized) application-level 
or multi-stage scheduling decisions. Coupled tasks may 
have to be scheduled on a single pilot, loosely coupled 
or uncoupled tasks to multiple pilots; tasks may be 
scheduled to a pilot and then to a specific pool of re¬ 
sources on a single compute node; or task scheduling 
may be prioritized depending on task size and duration. 

• Workload Environment. Type, dependences, and 
characteristics of the environment in which workload’s 
tasks are executed. Once scheduled to a pilot, a task 
needs an environment that satisfies its execution re¬ 
quirements. The execution environment depends on 
the type of task (e.g., single or multi-threaded, MPI), 
task code dependences (e.g., compilers, libraries, in¬ 
terpreters, or modules), and task communication, co¬ 
ordination and data requirements (e.g., interprocess, 
inter-node communication, data staging, sharing, and 
replication). 

4.2 Auxiliary properties 

Auxiliary properties are not specific to Pilot systems and 
may be optional for their implementation. Pilot systems share 
auxiliary properties with other types of system and Pilot 
system implementations may have different subsets of these 
properties. For example, authentication and authorization 
are properties shared by many systems and Pilot systems may 
have to implement them only for some DCRs. Analogously, 
communication and coordination is not a core property of 
Pilot systems because, at some level, all software systems 
require communication and coordination. 

We list a representative subset of auxiliary properties for 
Pilot systems in Table [2] The following describes these 
auxiliary properties and, also in this case, these properties 
refer to Pilot systems and not to individual pilots instantiated 
on a DCR. 

• Architecture. Pilot systems may be implemented by 
means of different architectures, e.g., service-oriented, 
client-server, or peer-to-peer. Architectural choices 
may depend on multiple factors, including application 
use cases, deployment strategies, or interoperability 
requirements. 

• Communication and Coordination. As discussed 
in §3.1| Pilot system implementations are not defined by 
any specific communication and coordination protocol 
or pattern. Communication and coordination among 
the Pilot system components are determined by its 
design, the chosen architecture, and the deployment 
scenarios. 

• Workload Semantics. Pilot-Job systems may sup¬ 
port workloads with different compute and data require¬ 
ments, and inter-task dependences. Pilot systems may 
assume that only workloads with a specific semantics 
are given or may allow the user to specify, for example, 
BoT, ensemble, or workflow. 
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Property 

Description 

Component 

Functionality 

Pilot Scheduling 

Pilot Bootstrapping 
Pilot Resources 

Modalities for pilot scheduling on DCRs 

Modalities for pilot bootstrapping on DCRs 

Types and characteristics of pilot resources 

Pilot Manager 

Pilot Provisioning 

Workload Binding 
Workload Scheduling 

Modalities and policies for binding tasks to pilots 
Modalities and policies for scheduling tasks to pilots 

Workload Manager 

Task Dispatching 

Workload Environment 

Type and features of the task execution environment 

Task Manager 

Task Execution 


Table 1: Mapping of the core properties of Pilot system implementations onto the components and function¬ 
alities described in §3.1[ Core properties are specific to Pilot systems and necessary for their implementation. 


Property 


Description 


Architecture 

Coordination and Communication 

Interface 

Interoperability 

Multitenancy 

Resource Overlay 

Robustness 

Security 

Files and Data 

Performance 

Development Model 

DCR Interaction 


Structures and components of the Pilot system 

Interaction protocols and patterns among the components of the system 
Interaction mechanisms both among components and exposed to the user 
Qualitative and functional features shared among Pilots systems 
Simultaneous use of the Pilot system components by multiple users 
The aggregation of resources from multiple pilots into overlays 
Resilience and reliability of pilot and workload executions 
Authentication, authorization, and accounting framework 
Mechanisms for data staging and management 
Measure of the scalability, throughput, latency, or memory usage 
Practices and policies for code production and management 
Modalities and protocols for pilot system/DCR interaction coordination 


Table 2: Sample of Auxiliary Properties and their descriptions. Auxiliary properties are not specific to Pilot 
systems and may be optional for their implementation. 


• Interface. Pilot systems may implement several pri¬ 
vate and public interfaces: among the components of 
the Pilot system; among the Pilot system, the appli¬ 
cations, and the DCRs; or between the Pilot system 
and the users via one or more application programming 
interfaces. 

• Interoperability. Pilot system may implement at 
least two types of interoperability: among Pilot system 
implementations, and among DCRs with heterogeneous 
middleware. For example, two Pilot systems may exe¬ 
cute tasks on each others’ pilots, or a Pilot system may 
be able to provide pilots on LSF, Slurm, Torque, or 
OpenStack middleware. 

• Multitenancy. Pilot systems may offer multitenancy 
at both system and local level. When offered at system 
level, multiple users can utilize the same instance of 
a Pilot system; when available at local level, multiple 
users can share the same pilot. Executing multiple 
pilots on the same DCR indicates the multitenancy of 
the DCR, not of the Pilot system. 

• Robustness. Indicates the features of a Pilot system 
that contribute to its resilience and reliability. Usually, 
fault-tolerance, high-availability, and state persistence 
are indicators of the maturity of the Pilot system im¬ 
plementation and its use cases support. 

• Security. The deployment and usability of Pilot sys¬ 
tems are influenced by security protocols and policies. 
Authentication and authorization can be based on di¬ 
verse protocols and vary across Pilot systems. 


• Data Management. As discussed in §3.1| only basic 
data reading/writing functionalities are mandated by 
a Pilot system. Nonetheless, most real-life use cases 
require more advanced data management functionalities 
that can be implemented within the Pilot system or 
delegated to third-party tools. 

• Performance and scalability. Pilot systems can be 
optimized for one or more performance metrics, depend¬ 
ing on the target use cases. For example, Pilot systems 
vary in terms of overheads they add to the execution of 
a given workload, size and duration of the workloads a 
user can expect to be supported, and type and number 
of supported DCRs and DCIs. 

• Development Model. The model used to develop 
Pilot systems may have an impact on the life span 
of the Pilot system, its maintainability and, possibly 
its evolution path. This is especially relevant when 
considering whether the development is supported by 
an open community or by a single research project. 

4.3 Implementations 

We analyze seven Pilot systems based on their availabil¬ 
ity, design, intended use, and uptake. We describe systems 
that: (i) implement diverse design; (ii) target specific or 
general-purpose use cases and DCR; and (iii) are currently 
available, actively maintained, and used by scientific com¬ 
munities. Space constraints prevented consideration of ad¬ 
ditional Pilot systems, as well as necessitated limiting the 
analysis to the core properties of Pilot systems. 

We compare Pilot systems using the architectural pattern 
and common terminology defined in [J3] Table [3] shows how 
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Figure 4: Diagrammatic representation of the 
Coaster System components, functionalities, and 
core terminology mapped on Figure [3j 


the components of the architectural pattern are named dif¬ 
ferently across implementations. Table [4] offers instead a 
summary of how the core properties are implemented for 
each Pilot system we compared]^] 


4.3.1 Coaster System 

The Coaster System (also referred to in literature as Coast¬ 
ers) was developed by the Distributed Systems Laboratory at 
the University of Chicago 129] and it is currently maintained 
by the Swift project 130 . Initially developed within the CoG 


project [131 and maintained in a separate, standalone repos¬ 
itory, today the Coaster System provides pilot functionalities 
to Swift by means of an abstract task interface [132 133| . 

The Coaster System is composed of three main compo¬ 
nents |l2j : a Coaster Client, a Coaster Service, and a set 
of Workers. The Coaster Client implements both a Boot¬ 
strap and a Messaging Service while the Coaster Service 
implements a data proxy service and a set of job providers 
for diverse DCRs middleware. Workers are executed on the 
DCR compute nodes to bind compute resources and execute 
the tasks submitted by the users to the Coaster System. 

Figure [4] illustrates how the Coaster System components 
map to the components and functionalities of a Pilot system 
as described in in S|3] the Coaster Client is a Workload 
Manager, the Coaster Service a Pilot Manager, and each 
Worker a Task Manager. The Coaster Service implements 
the Pilot Provisioning functionality by submitting adequate 
numbers of Workers on suitable DCRs. The Coaster Client 
implements Task Dispatching while the Workers implement 
Task Execution. 

The execution model of the Coaster System can be sum¬ 
marized in seven steps 105 : 1. a set of tasks is submitted 


by a user via the Coaster Client API; 2. when not already 
active, the Bootstrap Service and the Message Service are 
started within the Coaster Client; 3. when not already active, 
a Coaster Service is instantiated for the DCR(s) indicated in 
the task descriptions; 4. the Coaster Service gets the task 


descriptions and analyzes their requirements; 5. the Coaster 
Service submits one or more Workers to the target DCR tak¬ 
ing also into account whether any other Worker is already 
active; 6. when a Worker becomes active it pulls a task and, 
if any, its data dependences from the Coaster Client via the 
Coaster Service; 7. the task is executed. 

Each Worker holds compute resources in the form of com¬ 
pute cores. Data can be staged from a shared file-system, 
directly from the client to the Worker, or via the Coaster 
Service acting as a proxy. Data are not a type of resource 
held by the pilots and pilots are not used to expose data to 
the user. Networking capabilities are assumed to be avail¬ 
able among the components of the Coaster System, but a 
dedicated communication protocol is implemented and also 
used for data staging as required. 

The Coaster Service automates the deployment of pilots 
(i.e., Workers) by taking into account several parameters: 
total number of jobs that the DCR batch system accepts; 
number of cores for each DCR compute node; DCR policy for 
compute nodes allocation; walltime of the pilots compared 
to the total walltime of the tasks submitted by the users. 
These parameters are evaluated by a custom pilot deployment 
algorithm that performs a walltime overallocation estimated 
against user-defined parameters, and chooses the number and 
sizing of pilots on the base of the target DCR capabilities. 

The Coaster System serves as a Pilot backend for the Swift 
System and, together, they can execute workflows composed 
of loosely coupled tasks with data dependences. Natively, 
the Coaster Client implements a Java CoG Job Submission 
Provider 131,133||l34] for which Java API are available to 
submit tasks and to develop distributed applications. While 
tasks are assumed to be single-core by default, multi-core 
tasks can be executed by configuring the Coaster System to 
submit Workers holding multiple cores 


135 


It should also 

be possible to execute MPI tasks by having Workers to span 
multiple compute nodes of a DCR. 

The Coaster Service uses providers from the Java CoG Kit 
Abstraction Library to submit Workers to DCR with grid, 
HPC, and cloud middleware. The late binding of tasks to 
pilots is implemented by Workers pulling tasks to be executed 
as soon as free resources are available. It should be noted 
that tasks are bound to the pilots instantiated on a specific 
DCR specified as part of the task description. Experiments 
have been made with late binding to pilots instantiated on 
arbitrary DCRs but no documentation is currently available 
about the results obtained^] 


4.3.2 DIANE 

DIANE (Distributed ANalysis Environment) [13 has been 
developed at CERN [136| to support the execution of work¬ 
loads on the DCRs federated to be part of European Grid 
Infrastructure (EGI) 137 and worldwide LHC Computing 
Grid (WLCG). DIANE has also been used in the Life Sci¬ 
ences 138-140 and in few other scientific domains [141[|142] , 

DIANE is an application task coordination framework that 
executes distributed applications using the master-worker 
pattern [l3]. DIANE consists of four logical components: a 
TaskScheduler, an ApplicationManager, a SubmitterScript, 
and a set of ApplicationWorkers 143 . The first two com¬ 
ponents - TaskScheduler and the ApplicationManager - are 
implemented as a RunMaster service, while the Application- 


J Pilot systems are ordered alphabetically in the table and in 
the text. 


4 Based on private communication with the Coaster System 
development team. 
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Pilot System 

Pilot Manager 

Workload Manager 

Task Manager 

Pilot 


Coaster System 

Coaster Service 

Coaster Client 

Worker 

Job Agent 

DIANE 

Submitter script 

RunMaster 

Application Worker 

WorkerAgent 

DIRAC 

WMS (Directors) 

WMS (Match Maker) 

Job Wrapper 

Job Agent 

GlideinWMS 

Glidein Factory 

Schedd 

Startd 

Glidein 


MyCluster 

Cluster Builder Agent 

Virtual Login Session 

Task Manager 

Job Proxy 

PanDA 

Grid Scheduler 

PANDA Server 

RunJob 

Pilot 


RADICAL-Pilot 

Pilot Manager 

CU Manager 

Agent 

Pilot 


Table 3: Mapping of the names given to the components of the pilot architectural pattern defined in §|3.2[ 

Figure [3] and the 

names given to the components of pilot system implementations. 


Pilot 

Pilot Pilot 

Workload 


Workload 

Workload 

System 

Resources Deployment Semantics 


Binding 

Execution 

Coaster System 

Compute Implicit 

WF (Swift 126 

) 

Late 

Serial, MPI 

DIANE 

Compute Explicit 

WF (MOTOUF 

126 ) 

Late 

Serial 

DIRAC 

Compute Implicit 

WF (TMS) 


Late 

Serial, MPI 

GlideinWMS 

Compute Implicit 

WF (Pegaus, DAGMan 127 ) 

Late 

Serial, MPI 

MyCluster 

Compute Implicit 

job descriptions 

Late 

Serial, MPI 

PanDA 

Compute Implicit 

BoT 


Late 

Serial, MPI 

RADICAL-Pilot 

Compute, data Explicit 

ENS (EnsembleMD Toolkit 128 ) 

Early, Late 

Serial, MPI 


Table 4: Overview of Pilot systems and a summary of the values of their core properties. Based on the tooling 
currently available for each Pilot system, the types of workload supported as defined in §3.2| are: BoT = Bag 
of Tasks; ENS = Ensembles; WF = workflows. 



Figure 5: Diagrammatic representation of DIANE 
components, functionalities, and core terminology 
mapped on Figure [3] 


Workers as a WorkerAgent service. Submitter Scripts deploy 
Application Workers on DCRs. 

Figure [5] shows how DIANE implements the components 
and functionalities of a pilot system as described in i[3] the 
RunMaster service is a Workload Manager, the Submitter- 
Script is a Pilot Manager, and the Application Worker of each 
WorkerAgent service is a Task Manager. Accordingly, the 
Pilot provisioning functionality is implemented by the Sub- 
mitterScript, Task Dispatching by the RunMaster, and Task 
Execution by the WorkerAgent. In DIANE, Pilots are called 
“Worker Agents”. 


The execution model of DIANE can be summarized in 
four steps [144] : 1. the user submits one or more jobs to 
DCR by means of SubmitScript(s) to bootstrap one or more 
WorkerAgent; 2. When ready, the WorkAgent(s) reports 
back to the ApplicationManager; 3. tasks are scheduled by 
the TaskScheduler on the available WorkerAgent (s); 4. after 
execution, WorkerAgents send the output of the computation 
back to the ApplicationManager. 

The pilots used by DIANE (i.e., WorkerAgents) hold com¬ 
pute resources on the target DCRs. WorkerAgents are exe¬ 
cuted by the DCR middleware as jobs with mostly one core 
but possibly more. DIANE also offers a data service with a 
dedicated API and CLI that allows for staging files in and 
out of WorkerAgents. This service represents an abstraction 
of the data resources and capabilities offered by the DCR, 
and it is designed to handle data only in the form of files 
stored into a file system. Network resources are assumed to 
be available among DIANE components. 

DIANE requires a user to develop pilot deployment mech¬ 
anisms tailored to specific resources. The RunMaster service 
assumes the availability of pilots to schedule the tasks of the 
workload. Deployment mechanisms can range from direct 
manual execution of jobs on remote resources, to deploy¬ 
ment scripts, or full-fledged factory systems to support the 
sustained provisioning of pilots over extended periods of time. 

A tool called “GANGA” [145||146] is available to support 
the development of SubmitterScripts. GANGA facilitates the 
submission of pilots to diverse DCRs by means of a uniform 
interface and abstraction. GANGA offers interfaces for job 
submission to DCRs with Globus, HTCondor, UNICORE, 
or gLite middleware. 

DIANE has been designed to execute workloads that can 
be partitioned into ensembles of parametric tasks on multiple 
pilots. Each task can consist of an executable invocation 
but also of a set of instructions, OpenMP threads, or MPI 
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processes 144 . Relations among tasks and group of tasks 
can be specified before or during runtime enabling DIANE 
to execute articulated workflows. Plugins have been written 
to manage DAGs 147 and data-oriented workflows [148]. 


DIANE is primarily designed for HTC and Grid environ¬ 
ments and to execute pilots with a single core. Nonetheless, 
the notion of “capacity” is exposed to the user to allow for 
the specification of pilots with multiple cores. Although the 
workload binding is controllable by the user-programmable 
TaskScheduler, the general architecture is consistent with 
a pull model. The pull model naturally implements the 
late binding paradigm where every ApplicationAgent of each 
available pilot pulls a new task. 


4.3.3 DIRAC 

DIRAC (Distributed Infrastructure with Remote Agent 
Control) T49j is a software product developed by the CERN 
LHCb project. DIRAC implements a Workload Management 
System (WMS) to manage the processing of detector data, 
Monte Carlo simulations, and end-user analyses. DIRAC 
primarily serves as the LHCb workload management interface 
to WLCG executing workloads on DCRs deploying Grid, 
Cloud, and HPC middleware. 

DIRAC has four main logical components: a set of 
TaskQueues, a set of TaskQueueDirectors, a set of Job Wrap¬ 
pers, and a MatchMaker. TaskQueues, TaskQueueDirectors, 
and the MatchMaker are implemented within a monolithic 
WMS. Each TaskQueue collects tasks submitted by users, 
multiple TaskQeue being created depending on the require¬ 
ments and ownership of the tasks. JobWrappers are executed 
on the DCR to bind compute resources and execute tasks 
submitted by the users. Each TaskQueueDirector submits 
JobWrappers to target DCRs. The MatchMaker matches re¬ 
quests from JobWrappers to suitable tasks into TaskQueues. 

DIRAC was the first pilot-based WMS designed to serve a 
LHC main experiment 14 . Figure [6] shows how the DIRAC 
WMS implements a Workload, a Pilot, and a Task Man¬ 
ager as they have been described in © TaskQueues and 
the MatchMaker implement the Workload Manager and the 
related Task Dispatching functionality. Each TaskQueueDi¬ 
rector implements a Pilot Manager and its Pilot Provisioning 
functionality, while each Job Wrapper implements a Task 
Manager and Pilot Execution. 

The DIRAC execution model can be summarized in five 
steps: 1. a user submits one or more tasks by means of 
a CLI, Web portal, or API to the WMS Job Manager; 2. 
submitted tasks are validated and added to a new or an 
existing TaskQueue, depending on the task properties; 3. one 
or more TaskQueues are evaluated by a TaskQueueDirector 
and a suitable number of JobWrappers are submitted to 
available DCRs; 4. JobWrappers, once instantiated on the 
DCRs, pull the MatchMaker asking for tasks to be executed; 5. 
tasks are executed by the JobWrappers under the supervision 
of each JobWrapper’s Watchdog. 

JobWrappers, the DIRAC pilots, hold compute resources 
in the form of single or multiple cores, spanning portions, 
whole, or multiple compute nodes. A dedicated subsystem 
is offered to manage data staging and replication but data 
capabilities are not exposed via pilots. Network resources 
are assumed to be available to allow pilots to communicate 
with the WMS. 

Pilots are deployed by TaskQueueDirectors. Three main 



Figure 6: Diagrammatic representation of DIRAC 
components, functionalities, and core terminology 
mapped on Figure [3j 


operations are iterated: 1. getting a list of TaskQueues; 
2 . calculating the number of pilots to submit depending 
on the user-specified priority of each task, and the number 
and properties of the available or scheduled pilots; and 3. 
submitting the calculated number of pilots. 

Natively, DIRAC can execute tasks described by means of 
the Job Description Language (JDL) |150]. As such, single¬ 
core, multi-core, MPI, parametric, and collection tasks can be 
described and submitted. Users can specify a priority index 
for each submitted task and one or more specific DCR that 
should be targeted for execution. Tasks with complex data 
dependences can be described by means of a DIRAC system 
called “Transformation Management System” (TMS) [151] . 
In this way, user-specified, data-driven workflows can be 
automatically submitted and managed by the DIRAC WMS. 

Similar to DIANE and the Coaster System, DIRAC fea¬ 
tures a task pull model that naturally implements late binding 
of tasks to pilots. Each Job Wrapper pulls a new task once it 
is available and has free resources. No early binding of tasks 
on pilots is offered. 

4.3.4 HTCondor Glidein and GlideinWMS 

The HTCondor Glidein system |152] was developed by 
the Center for High Throughput Computing at the Univer¬ 
sity of Wisconsin-Madison (UW-Madison) 153 as part of 
the HTCondor [154) software ecosystem. The HTCondor 
Glidein system implements pilots to aggregate DCRs with 
heterogeneous middleware into HTCondor resource pools. 

The logical components of HTCondor relevant to the 
Glidein system are: a set of Schedd and Startd daemons, a 
Collector, and a Negotiator 10 . Schedd is a queuing sys¬ 
tem that holds workload tasks and Startd handles the DCR 
resources. The Collector holds references to all the active 
Schedd/Startd daemons, and the Negotiator matches tasks 
queued in a Schedd to resources handled by a Startd. 

HTCondor Glidein has been complemented by Glidein¬ 
WMS, a Glidein-based workload management system that 
automates deployment and management of Glideins on multi¬ 
ple types of DCR middleware. GlideinWMS builds upon the 
HTCondor Glidein system by adding the following logical 
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Figure 7: Diagrammatic representation of Glidein 
components, functionalities, and core terminology 
mapped on Figure [3] 


components: a set of Glidein Factory daemons, a set of Fron- 
tend daemons for Virtual Organization (VO) |l55[[156| , and 
a Collector dedicated to the WMS [157 . Glidein Factories 
submit tasks to the DCRs middleware, each VO Frontend 
matches the tasks on one or more Schedd to the resource 
attributes advertised by a specific Glidein Factory, and the 
WMS Collector holds references to all the active Glidein 
Factories and VO Frontend daemons. 

Figure [T] shows the mapping of the HTCondor Glidein 
Service and Glidein WMS elements to the components and 
functionalities of a Pilot system as described in E The set 
of VO Frontends and Glidein Factories alongside the WMS 
collector implement a Pilot Manager and its pilot provision¬ 
ing functionality. The set of Schedd, the Collector, and the 
Negotiator implement a Workload Manager and its task dis¬ 
patching functionality. The Startd daemon implements a 
Task Manager alongside its task execution functionality. A 
Glidein is a job submitted to a DCR middleware that, once in¬ 
stantiated, configures and executes a Startd daemon. Glidein 
is therefore a pilot. 

The execution model of the HTCondor Glidein system can 
be summarized in nine steps: 1. the user submits a Glidein 
(i.e., a job) to a DCR batch scheduler; 2. once executed, this 
Glidein bootstraps a Startd daemon; 3. the Startd daemon 
advertises itself to the Collector; 4. the user submits the 
tasks of the workload to the Schedd daemon; 5. the Schedd 
advertises these tasks to the Collector; 6. the Negotiator 
matches the requirements of the tasks to the properties of 
one of the available Startd daemon (i.e., a Glidein); 7. the 
Negotiator communicates the match to the Schedd; 8. the 
Schedd submits the tasks to the Startd daemon indicated by 
the Negotiator; 9. the task is executed. 

GlideinWMS extends the execution model of the HTCon¬ 
dor Glidein system by automating the provision of Glideins. 
The user does not have to submit Glidein directly but only 
tasks to Schedd. From there: 1. every Schedd advertises its 
tasks with the VO Frontend; 2. the VO Frontend matches 
the tasks’ requirements to the resource properties advertised 
by the WMS Connector; 3. the VO Frontend places requests 


for Glideins instantiation to the WMS Collector; 4. the 
WMS Collector contacts the appropriate Glidein Factory to 
execute the requested Glideins; 5. the requested Glideins be¬ 
come active on the DCRs; and 6. the Glideins advertise their 
availability to the (HTCondor) Collector. From there on the 
execution model is the same as described for the HTCondor 
Glidein Service. 

The resources managed by a single Glidein (i.e., pilot) 
are limited to compute resources. Glideins may bind one or 
more cores, depending on the target DCRs. For example, 
heterogeneous HTCondor pools with resources for desktops, 
workstations, small campus clusters, and some larger clusters 
will run mostly single core Glideins. More specialized pools 
that hold, for example, only DCRs with HTC, Grid, or Cloud 
middleware may instantiate Glideins with a larger number 
of cores. Both HTCondor Glidein and GlideinWMS provide 
abstractions for hie staging but pilots are not used to hold 
data or network resources. 

The process of pilot deployment is the main difference 
between HTCondor Glidein and GlideinWMS. While the 
HTCondor Glidein system requires users to submit the pi¬ 
lots to the DCRs, GlideinWMS automates and optimizes 
pilot provisioning. GlideinWMS attempts to maximize the 
throughput of task execution by continuously instantiating 
Glideins until the queues of the available Schedd are emp¬ 
tied. Once all the tasks have been executed, the remaining 
Glideins are terminated. 

HTCondor Glidein and Glide WMS expose the interfaces 
of HTCondor to the application layer and no theoretical lim¬ 
itation is posed on the type and complexity of the workloads 
that can be executed. For example, DAGMan (Directed 
Acyclic Graph Manager) 158] has been designed to exe¬ 
cute workflows by submitting tasks to Schedd, and a tool is 
available to design applications based on the master-worker 
coordination pattern. 

HTCondor was originally designed for resource scavenging 
and opportunistic computing. Thus, in practice, independent 
and single (or few-core) tasks are more commonly executed 
than many-core tasks, as is the case for OSG, the largest 
HTCondor and GlideinWMS deployment. Nonetheless, in 
principle projects may use dedicated installation and re¬ 
sources to execute tasks with larger core requirements both 
for distributed and parallel applications, including MPI ap¬ 
plications. 

Both HTCondor Glidein and GlideWMS rely on one or 
more HTCondor Collectors to match task requirements and 
resource properties, represented as ClassAds [159] . This 
matching can be evaluated right before the scheduling of the 
task. In this way, late binding is achieved but early binding 
remains unsupported. 

4.3.5 My Cluster 

MyCluster 160][l61 is not maintained but is included in 
the comparison because it presents some distinctive features. 
Its user/Pilot system interface and task submission system 
based on the notion of virtual cluster highlight the flexibility 
of Pilot systems implementations. Moreover, MyCluyster 
was one of the first Pilot system to be aimed specifically at 
HPC DCRs. 

MyCluster was originally developed at the Texas Advanced 
Computing Center (TACC) 162], sponsored by NSF to en¬ 
able execution of workloads on TeraGrid 1 63] , a set of DCRs 
deploying Grid middleware. MyCluster provides users with 
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Figure 8: Diagrammatic representation of MyClus¬ 
ter components, functionalities, and core terminol¬ 
ogy mapped on Figure [3j 


virtual clusters: aggregates of homogeneous resources dynam¬ 
ically acquired on multiple and diverse DCRs. Each virtual 
cluster exposes HTCondor, SGE 164] , or OpenPBS [165] 
job-submission systems, depending on the user and use case 
requirements. 

MyCluster is designed around three main components: a 
Cluster Builder Agent, a system where users create Virtual 
Login Sessions, and a set of Task Managers. The Cluster 
Builder Agent acquires the resources from diverse DCRs by 
means of multiple Task Managers, while the Virtual Login 
Session presents these resources as a virtual cluster to the 
user. A virtual login session can be dedicated to a single 
user, or customized and shared by all the users of a project. 
Upon login on the virtual cluster, a user is presented with a 
shell-like environment used to submit tasks for execution. 

Figure [8] shows how the components of MyCluster map 
to the components and functionalities of a Pilot system as 
described in §3.1| The Cluster Builder Agent implements 
a Pilot Manager and a Virtual Login Session implements 
a Workload Manager. The Task Manager shares its name 
and functionality with the homonymous component defined 
in §3.1| The Cluster Builder Agent provides Task Managers 
by submitting Job Proxies to diverse DCRs, and a Virtual 
Login Session uses the Task Managers to submit and execute 
tasks. As such, Job Proxies are pilots. 

The execution model of MyCluster can be summarized 
in five steps: 1. a user logs into a dedicated virtual cluster 
via, for example, ssh to access a dedicated Virtual Login 
Session; 2. the user writes a job wrapper script using the 
HTCondor, SGI, or OpenPBS job specification language; 3. 
the user submits the job to the job submission system on 
the virtual cluster; 4. the Cluster Builder Agent submits 
a suitable number of Job Proxies on one or more DCR; 5. 
when the Job Proxies become active, the user-submitted job 
is executed on the resources they hold. 

Job Proxies hold compute resources in the form of compute 
cores. MyCluster does not offer any dedicated data subsystem 
and Job Proxies (i.e. pilots) are not used to expose data 
resources to the user. Users are assumed to stage the data 


required by the compute tasks directly, or by means of the 
data capabilities exposed by the job submission system of 
the virtual cluster. Networking is assumed to be available 
among the MyCluster components. 

The Cluster Builder Agent submits Job Proxies to each 
DCR by using the GridShell framework [166] . GridShell 
wraps the Job Proxies description into the job description 
language supported by the target DCR. Thanks to GridShell, 
MyCluster can submit jobs to DCR with diverse middleware. 

MyCluster exposes a virtual cluster with a predefined job 
submission system to the user. Pilots can have a user-defined 
amount of cores inter or cross-compute node. As such, every 
application built to utilize HTCondor, SGE, or OpenPBS can 
be executed transparently on MyCluster. This includes single 
and multi-core tasks, MPI tasks, and data-driven workflows. 

The jobs specified by a user are bound to the DCR resources 
as soon as Job Proxies become active. The user does not have 
to specify on which Job Proxies or DCR each task has to be 
executed. In this way, MyCluster implements late binding. 

4.3.6 PANDA 

PanDA (Production and Distributed Analysis) 167 was 
developed to provide a workload management system (WMS) 
for ATLAS. ATLAS is a particle detector at the LHC that 
requires a WMS to handle large numbers of tasks for their 
data-driven processing workloads. In addition to the logistics 
of handling large-scale task execution, ATLAS also needs 
integrated monitoring for the analysis of system state, and a 
high degree of automation to reduce user and administrative 
intervention. 

PanDA has been initially deployed as a HTC-oriented, 
multi-user WMS system consisting of 100 heterogeneous 
computing sites [168] , Recent improvements to PanDA have 
extended the range of deployment scenarios to HPC and 
cloud-based DCRs making PanDA a general-purpose Pilot 
system 169 . 

PanDA architecture consists of a Grid Scheduler and a 
PanDA Server [TfO, 17T]. The Grid Scheduler is implemented 
by a component called “AutoPilot” that submits jobs to 
diverse DCRs. The PanDA server is implemented by four 
main components: a Task Buffer, a Broker, a Job Dispatcher, 
and a Data Service. The Task Buffer collects all the submitted 
tasks into a global queue and the Broker prioritizes and binds 
those tasks to DCRs on the basis of multiple criteria. The 
Data Service stages the input file(s) of the tasks to the DCR 
to which the tasks have been bound using the data transfer 
technologies exposed by the DCR middleware (e.g., uberftp, 
gridftp, or lcg-cp). The Job Dispatcher delivers the tasks to 
the RunJobs run by each Pilot bound to a DCR. 

Figure [9] shows how PANDA implements the components 
and functionalities of a Pilot system as described in E the 
Grid Scheduler is a Pilot Manager implementing Pilot Pro¬ 
visioning while the PanDA Server is a Workload Manager 
implementing Task Dispatching. The jobs submitted by the 
Grid Scheduler are called “Pilots” and act as pilots once in¬ 
stantiated on the DCR by running RunJob, i.e., the Task 
Manager. RunJob contacts the Job Dispatcher component 
to request for tasks to be executed. 

The execution model of PANDA can be summarized in 
eight steps (T72JT73]: 1. the user submits tasks to the PanDA 
server; 2. the tasks are queued within the Task Buffer; 3. the 
tasks requirements are evaluated by the Broker and bound 
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Figure 9: Diagrammatic representation of PANDA 
components, functionalities, and core terminology 
mapped on Figure [3] 


Figure 10: Diagrammatic representation of RADI¬ 
CAL Pilot components, functionalities, and core ter¬ 
minology mapped on Figure [3j 


to a DCR; 4. the input files of the tasks are staged to the 
bound DCR by the Data Service; 5. the required pilot(s) are 
submitted as jobs to the target DCR; 6. the submitted pilot(s) 
becomes available and reports back to the Job Dispatcher; 7. 
tasks are dispatched to the available pilots for execution; 8. 
tasks are executed. 

PanDA pilots expose mainly single cores, but extensions 
have been developed to instantiate pilots with multiple 
cores j 174] . The Data Service of PanDA allows the integra¬ 
tion and automation of data staging within the task execution 
process, but no pilots are offered for data 168| . Network 
resources are assumed to be available among PanDA compo¬ 
nents, but no network-specific abstraction is made available. 

The AutoPilot component of PanDA’s Grid Scheduler has 
been designed to use multiple methods to submit pilots to 
DCRs. The PanDA installations of the US ATLAS infras¬ 
tructure uses the HTCondor-G [72] system to submit pilots 
to the US production sites. Other schedulers enable AutoPi¬ 
lot to submit to local and remote batch systems and to the 
GlideinWMS frontend. Submissions via the canonical tools 
offered by HTCondor have also been used to submit tasks to 
cloud resources. 

PanDA was initially designed to serve specifically the AT¬ 
LAS use case, executing mostly single-core tasks with in¬ 
put and output files. Since its initial design, the ATLAS 
analysis and simulation tools have started to investigate 
multi-core task execution with AthenaMP 174] and PanDA 
has been evolving towards a more general purpose workload 
manager [175f]177] . Currently, PanDA offers experimental 
support for multi-core pilots and tasks with or without data 
dependences. PanDA is being generalized to support appli¬ 
cations from a variety of science domains. 178]. 

PanDA offers late binding but not early binding capabilities. 
Workload jobs are assigned to activated and validated pilots 
via the PanDA server based on brokerage criteria like data 
locality and resource characteristics. 

4.3.7 RADICAL-Pilot 

The authors of this paper have been engaged in theoret¬ 
ical and practical aspects of Pilot systems. In addition to 


formulating the P* Model 179', the RADICAL group [180] 
is responsible for developing and maintaining the RADICAL- 
Pilot Pilot system 181 . RADICAL-Pilot is built upon the 
experience gained from developing BigJob, and integrating 
it with many applications 182 ]184| on different DCRs. 

RADICAL-Pilot consists of five main logical components: 
a Pilot Manager, a Compute Unit (CU) Manager, a set of 
Agents, the SAGA-Python DCR interface, and a database. 
The Pilot Manager describes pilots and submits them via 
SAGA-Python to DCR(s), while the CU manager describes 
tasks (i.e. CU) and schedules them to one or more pilots. 
Agents are instantiated on DCRs and execute the CUs pushed 
by the CU manager. The database is used for the communi¬ 
cation and coordination of the other four components. 

RADICAL-Pilot closely resembles the description offered 
in Sj3](see Figure [To]) . The Pilot Manager and SAGA-Python 
implement the logical component also called “Pilot Manager” 
in |3.1| The Workload Manager is implemented by the CU 
Manager. The Agent is deployed on the DCR to expose its 
resources and execute the tasks pushed by the CU Manager. 
As such, the Agent is a pilot. 

RADICAL-pilot is implemented as two Python modules 
to support the development of distributed applications. The 
execution model of RADICAL-Pilot can be summarized in 
six steps: 1. the user describes tasks in Python as a set of 
CUs with or without data and DCR dependences; 2. the 
user also describes one or more pilots choosing the DCR(s) 
they should be submitted to; 3. upon execution of the user’s 
application, the Pilot Manager submits each pilot that has 
been described to the indicated DCR utilizing the SAGA 
interface; 4. The CU Manager schedules each CU either to 
the pilot indicated in the CU or on the first pilot with free 
and available resources. Scheduling is done by storing the 
CU description into the database; 5. when required, the 
CU Manager also stages the CU’s input file(s) to the target 
DCR; and 6. the Agent pulls its CU from the database and 
executes it. 

The Agent component of RADICAL-Pilot offers abstrac¬ 
tions for both compute and data resources. Every Agent can 
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expose between one and all the cores of the compute node 
where it is executed; it can also expose a data handle that 
abstracts away specific storage properties and capabilities. 
In this way, the CUs running on an Agent can benefit from 
unified interfaces to both core and data resources. Network¬ 
ing is assumed to be available between the RADICAL-Pilot 
components. 

The Pilot Manager deploys the Agents of RADICAL-Pilot 
by means of the SAGA-python API 98 . SAGA provides ac¬ 


cess to diverse DCR middleware via a unified and coherent 
API, and thus RADICAL-Pilot can submit pilots to resources 
exposed by XSEDE and NERSC 
dor pools, and many “leadership' 
managed by OLCF 


185 , by the OSG HTCon- 


class systems like those 
186 or NCSA fl87l. 


The resulting separation of agent deployment from DCR 
architecture reduces the overheads of adding support for a 
new DCR [23] . This is illustrated by the relative ease with 
which RADICAL-Pilot is extended to support (i) a new type 
of DCR such as IaaS, and (ii) DCRs that have essentially 
similar architecture but different middleware, for example 
the Cray supercomputers operated in the US and Europe. 

RADICAL-Pilot can execute tasks with varying coupling 
and communication requirements. Tasks can be completely 
independent, single or multi-threaded; they may be loosely 
coupled requiring input and output hies dependencies, or 
they might require low-latency runtime communication. As 
such, RADICAL-Pilot supports MPI applications, workflows, 
and diverse execution patterns such as pipelines. 

CU descriptions may or may not contain a reference to 
the pilot to which the user wants to bind the CU. When a 
reference is present, the scheduler of the CU Manager waits 
for a slot to be available on the indicated pilot. When a target 
pilot is not specified, the CU Manager binds and schedules 
the CU on the Erst pilot available. As such, RADICAL-Pilot 
supports both early and late binding, depending on the use 
case and the user specihcations. 


4.4 Comparison 

The previous subsection shows how diverse Pilot system 
implementations conform to the architecture pattern we 
described in j ]3.2| This confirms the generality of the pattern 
at capturing the components and functionalities required to 
implement a Pilot system. The described Pilot systems also 
show implementation differences, especially concerning the 
following auxiliary properties: Architecture, Communication 
and Coordination, Interoperability, Interface, Security, and 
Performance and Scalability. 

The Pilot systems described in §4.3| implement different 
architectures. DIANE, DIRAC, and, to some extent, both 
PANDA and the Coaster System are monolithic (Figures [5j 
[6] [9} and[4|. Most of their functionalities are aggregated into 
a single component implemented “as a service” [188] . A dedi¬ 
cated hardware infrastructure is assumed for a production- 
grade deployment of DIRAC and PANDA. Consistent with 
a Globus-oriented design, the Coaster Service is instead as¬ 
sumed to be run on the DCR acting as a proxy for both the 
pilot and workload functionalities. 

MyCluster and RADICAL-Pilot also are mostly mono¬ 
lithic (Figures [To| and [8|) but not implemented as a service. 
MyCluster resembles the architecture of a HPC middleware 
while Radical-Pilot is implemented as two Python modules. 
MyCluster requires dedicated hardware analogously to the 
head-node of a traditional HPC cluster. RADICAL-Pilot 


users are instead free to decide where to deploy their applica¬ 
tions, either locally on workstations or remotely on dedicated 
machines. In production-grade deployment, RADICAL-Pilot 
requires a dedicated database to support its communication 
and coordination protocols. 

GlideinWMS requires integration within the HTCondor 
ecosystem and therefore also a service oriented architecture 
but it departs from a monolithic design. GlideinWMS im¬ 
plements a set of separate, mostly autonomous services (Fig¬ 
ure [TJ) that can be deployed depending on the available 
resources and on the motivating use case. 

Architecture frameworks and description languages 
190] can be used to further specify and rehne the compo- 
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nent architectures in Figures |4[10| For example, the 4+1 
framework alongside a UML-based notation 191, 192] could 
be used to describe multiple “views” of each Pilot system ar¬ 
chitecture, offering more details and better documentation 
about the implementation of their components, the function¬ 
alities provided to the user, the behavior of the system, and 
its deployment scenarios. 

The Pilot systems described in the previous subsection 
also display differences in their communication and coordina¬ 
tion models. While all the Pilot systems assume preexisting 
networking functionalities, the Coaster System implements a 
dedicated communication protocol used both for coordination 
and data staging. The Coaster System and RADICAL-Pilot 
can both work as communication proxies among the Pilot 
system’s components when the DCR compute nodes do not 
expose a public network interface. All the Pilot systems imple¬ 
ment the master-worker coordination pattern, but the Task 
and the Workload Managers in DIRAC, PANDA, MyClus¬ 
ter, and the Coaster System can also coordinate to recover 
task failures and isolate under-performing or failing DCR 
compute nodes. 

Figures |4|| 10| also shows different interfaces between Pilot 
systems and DCRs, and between Pilot systems and users 
or applications. Most of the described Pilot systems inter¬ 
operate across diverse DCR middleware, including HPC, grid, 
and cloud batch systems. Implementations of this interoper¬ 
ability diverge, ranging from the dedicated SAGA API used 
by RADICAL-Pilot, to special-purpose connectors used by 
DIANE, DIRAC and PANDA, to the installation of special¬ 
ized components on the DCR middleware used by Coaster 
System, Glidein, and MyCluster. These interfaces are func¬ 
tionally analogous; reducing their heterogeneity would limit 
effort duplication and promote interoperability across Pilot 
systems. 

The interfaces exposed to give users access to pilot capa¬ 
bilities differ both in types and implementations. DIANE, 
DIRAC, GlideinWMS, MyCluster, and PANDA offer com¬ 
mand line tools. These are often tailored to specihc use cases, 
applications, and DCRs, requiring to be installed on the 
users’ workstations or on dedicated machines. The Coaster 
System and RADICAL-Pilot expose an API, and the com¬ 
mand line tools of DIANE, DIRAC, and PANDA are built 
on APIs that users may directly access to develop distributed 
applications. 

Differences in the user interfaces stem from assumptions 
about distributed applications and their use cases. Interfaces 
based on command line tools assume applications that can 
be “submitted” to the Pilot system for execution. APIs 
assume instead applications that need to be coded by the 
user, depending on the specihc requirements of the use case. 
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These assumptions justify multiple aspects of the design 
of Pilot systems, determining many characteristics of their 
implementations. 

The described Pilot systems also implement different types 
of authentication and authorization (AA). The AA required 
by the user to submit tasks to their own pilots varies depend¬ 
ing on the pilot’s tenancy. With single tenancy, AA can be 
based on inherited privileges as the pilot can be accessed 
only by the user that submitted it. With multitenancy, the 
Pilot system has to evaluate whether a user requesting access 
to a pilot is part of the group of allowed users. This requires 
abstractions like virtual organizations and certificate author¬ 
ities 1193], implemented, for example, by GlideinWMS and 
the Coaster Systems. 

The credential used for pilot deployment depends on the 
target DCR. The AA requirements of DCRs are a diverse and 
often inconsistent array of mechanisms and policies. Pilot 
systems are gregarious in the face of such a diversity as they 
need to present the credentials provided by the application 
layer (or directly by the user) to the DCR. As such, the 
AA requirements specific to Pilot systems are minimal but 
the implementation required to present suitable credentials 
may be complex, especially when considering Pilot systems 
offering interoperability among diverse DCRs. 

Finally, the differences among Pilot system implementa¬ 
tions underline the difficulties in defining and correlating 
performance metrics. The performance of each Pilot system 
can be evaluated under multiple metrics that are affected 
by the workload, the Pilot system behavior, and the DCR. 
For example, the commonly used metrics of system overhead 
and workload’s time to completion depend on the design of 
the Pilot system; on the data, compute and network require¬ 
ments of the workload executed; and on the capabilities of 
the target resources. These parameters vary at every exe¬ 
cution and require dedicated instrumentation built into the 
Pilot system to be measured. Without consistent perfor¬ 
mance models and set of probes, performance comparison 
among Pilot systems appears unfeasible. 


5. DISCUSSION AND CONCLUSION 

We introduced the Pilot abstraction in fJ3] describing the 
capabilities, components, and architecture pattern of Pilot 
systems. We also defined a terminology consistent across 
Pilot systems clarifying the meaning of “pilot”, “job”, and 
their cognate concepts. In f|4]we offered a classification of the 
core and auxiliary properties of Pilot system implementations, 
and we analyzed a set of exemplars. Considered altogether, 
these contributions outline a paradigm for the execution of 
heterogeneous, multi-task workloads via multi-entity and 
multi-stage scheduling on DCR resource placeholders. This 
computing paradigm is here referred to as “Pilot paradigm”. 


5.1 The Pilot Paradigm 

The generality of the Pilot paradigm may come as a sur¬ 
prise when considering that, traditionally, Pilot systems have 
been implemented to optimize the throughput of single-core 
(or at least single-node), short-lived, uncoupled tasks execu¬ 
tion [3} |194 1 195 . For example DIANE, DIRAC, MyCluster, 
PanDA, or HTCondor Glidein and GlideinWMS were initially 
developed to focus on either a type of workload, a specific 
infrastructure, or the optimization of a single performance 
metric. 

The Pilot paradigm is general because the execution of a 


workload via multi-entity and multi-stage scheduling on DCR 
resource placeholders does not have to depend on a single 
type of workload, DCR, or resource. In principle, systems 
implementing the Pilot paradigm can execute workloads 
composed of an arbitrary number of tasks with diverse data, 
compute, and networking requirements. The same generality 
applies to the types of DCR and of resource on which a Pilot 
system executes workloads]^] 

The analysis presented in Sjdj shows how Pilot systems 
have progressed to implement the generality of the Pilot 
paradigm. Pilot systems are now engineered to execute 
homogeneous or heterogeneous workloads; these workloads 
can be comprised of independent or intercommunicating tasks 
of arbitrary duration or data and computation requirements. 
These workloads can also be executed on an increasingly 
diverse pool of DCRs. Pilot systems were originally designed 
for DCR with HTC grid middleware; Pilot systems have 
emerged that are capable of also operating on DCRs with 
HPC and cloud middleware. 

As seen in © the Pilot paradigm demands resource place¬ 
holders but does not specify the type of resource that the 
placeholder should expose. In principle, pilots can also be 
placeholders for data or network resources, either exclusively 
or in conjunction with compute resources. For example, in 
Ref. |97 the concept of Pilot-Data was conceived to be fun¬ 
damental to dynamic data placement and scheduling as Pilot 
is to computational tasks. The concept of “Pilot networks” 
was introduced in Ref. [196] in reference to Software-Defined 
Networking 197 and User-Schedulable Network paths. 198 


The generality of the Pilot paradigm also promotes the 
adoption of Pilot functionalities and systems by other mid¬ 
dleware and tools. For example, Pilot systems have been 
successfully integrated within workflow systems to support 
optimal execution of workloads with articulated data and 
single or multi-core task dependencies [l03j[T32|[T99] . As 
such, not only can throughput be optimized for multi-core, 
long-lived, coupled tasks executions, but also for optimal 
data/compute placement, and dynamic resource sizing. 

The Pilot paradigm is not limited to academic projects 
and scientific experiments. Hadoop [200] introduced the 
YARN 201 resource manager for heterogeneous workloads. 


YARN supports multi-entity and multi-stage scheduling: ap¬ 
plications initialize an “Application-Master” via YARN; the 
Application Master allocates resources in “containers” for 
the applications; and YARN then can execute tasks in these 
containers (i.e., resource placeholders). TEZ [202| , a DAG 
processing engine primarily designed to support the Hive 
SQL engine 203 , enables applications to hold containers 


across the DAG execution without de/reallocating resources. 
Independent of the Hadoop developments, Google’s Kuber- 
is emerging as a leading container management 
Not completely coincidently, Kubernetes is the 
“Pilot”. 


netes 
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approach. 

Greek term for the English 


5.2 Future Directions and Challenges 

The Pilot landscape is currently fragmented with dupli¬ 
cated effort and capabilities. The reasons for this balka- 


5 The generality of the pilot paradigm across workload, DCR, 


and resource types was first discussed in Ref. 179 , wherein 


an initial conceptual model for Pilot systems was proposed. 
The introduction of the pilot architecture pattern and the 
discussion in Spjand Sj3] enhances and extends the preliminary 
analysis of Reh 179 . 
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nization can be traced back mainly to two factors: (i) the 
relatively recent discovery of the generality and relevance of 
the Pilot paradigm; and (ii) the development model fostered 
within academic institutions. 

As seen in g] and gj Pilot systems were developed as a 
pragmatic solution to improve the throughput of distributed 
applications, and were designed as local and point solutions. 
Pilot systems were not thought from their inception as an 
independent system, but, at best, as a module within a 
framework. Inheriting the development model of the sci¬ 
entific projects within which they were initially developed, 
Pilot systems were not engineered to promote (re)usability, 
modularity, open interfaces, or long-term sustainability. Col¬ 
lectively, this resulted in duplication of development effort 
across frameworks and projects, and hindered the apprecia¬ 
tion for the generality of the Pilot abstraction, the theoretical 
framework underlying the Pilot systems, and the paradigm 
for application execution they enable. 

Consistent with this analysis, many of the Pilot systems de¬ 
scribed in §4.3| offer a set of overlapping functionalities. This 
duplication may have to be reduced in the future to promote 
maintainability, robustness, interoperability, extensibility, 
and overall capabilities of existing Pilot systems. As seen 
in j ]4.4[ Pilot systems are already progressively supporting 
diverse DCRs and types of workload. This trend might lead 
to consolidation and to increased adoption of multi-purpose 
Pilot systems. The scope of the consolidation process will 
depend on the diversity of used programming languages, de¬ 
ployment models, interaction with existing applications, and 
how they will be addressed. 

The analysis proposed in this paper suggests critical com¬ 
monalities across Pilot systems stemming from a shared 
architectural pattern, abstraction, and computing paradigm. 
Models of pilot functionality can be grounded on these com¬ 
monalities, as well as be reflected in the definition of unified 
and open interfaces for the users, applications, and DCRs. 
End-users, developers, and DCR administrators could rely 
upon these interfaces, which would promote better integra¬ 
tion of Pilot systems into application and resource-facing 
middlware. 

There is evidence of ongoing integration and consolida¬ 
tion processes, such as the adoption of extensible workload 
management capabilities or utilization of similar resource in¬ 
teroperability layers. For example, PanDA is iterating its de¬ 
velopment cycle and the resulting system, called “Big PanDA” 
is now capable of opportunistically submitting pilots to the 
Titan supercomputer [20 5] at the Oak Ridge Leadership Com¬ 
puting Facility (OLCFj |l86 [20 6 . Further, Big PanDA has 
adopted SAGA, an open and standardized DCR interoper¬ 
ability library developed independent of Pilot systems but 
now adopted both by Big Panda and RADICAL-Pilot. 


together with the logical components and functionalities of 
the Pilot systems to specify the pilot architecture pattern in 
Figure [3] 

We defined the core and auxiliary properties of Pilot system 
implementations in f[4] (Tables [I] and [2| . We then used these 
properties alongside the contributions offered in Sj3]to describe 
seven exemplar Pilot system implementations. We gave 
details about their architecture and execution model showing 
how they conformed to the pilot architecture paradigm we 
defined in §3.2| We summarized this analysis in Figures [4||10| 

We used the Pilot abstraction and insight about Pilot 
systems, their motivations and diverse implementations to 
highlight the properties of the Pilot paradigm in S[5] We 
argued for the generality of the Pilot paradigm on the basis of 
demonstrated generality of the type of workload and use cases 
Pilot systems can execute, as well as a lack of constraints on 
the type of DCR that can be used or on the type of resource 
exposed by the pilots. Finally, we reviewed the benefits that 
a more structured approach to the conceptualization and 
design of Pilot systems may offer. 

With this paper, we also contributed a methodology to 
evaluate software systems that have developed organically 
and without an established theoretical framework. This 
methodology is composed of five steps: (i) analysis of the 
abstraction(s) underlying the observed software system im¬ 
plementations; (ii) the definition of a consistent terminology 
to reason about abstractions; (iii) the evaluation of the com¬ 
ponents and functionalities that may constitute a specific 
architectural pattern for the implementation of that abstrac¬ 
tion; (iv) the definition of core and auxiliary implementation 
properties; (v) the evaluation of implementations. 

The application of this methodology offers the opportunity 
to uncover the theoretical framework underlying the observed 
software systems, and to understand whether such systems 
are implementations of a well-defined and independent ab¬ 
straction. This theoretical framework can be used to inform 
or understand the development and engineering of software 
systems without mandating specific design, representation, 
or development methodologies or tools. 

Workflow systems are amenable to be studied with the 
methodology proposed and used in this paper. Multiple work- 
flow systems have been developed independently to serve 
diverse use cases and be executed on heterogeneous DCRs. 
In spite of broad surveys [52| [207- |209] about workflow sys¬ 
tems and their usage scenarios, an encompassing theoretical 
framework for the underlying abstraction, or set of abstrac¬ 
tions if any, is not yet available. This is evident in the state 
of workflow systems which shows a significant duplication 
of effort, limited extensibility and interoperability, and pro¬ 
prietary solutions for interfaces to both the resource and 
application layers. 


5.3 Summary and Contributions 

This paper contributes to the understanding, design, and 
adoption of Pilot systems by characterizing the Pilot abstrac¬ 
tion, the Pilot paradigm, and exemplar implementations. 

We provided an analysis of the technical origins and mo¬ 
tivations of Pilot systems in ([2] and we summarized their 
chronological development in Figure [I] We described the log¬ 
ical components and functionalities that constitute the Pilot 
abstraction in 0 and we outlined them in Figure [2] We then 
defined a consistent terminology to clarify the heterogeneity 
of the Pilot systems landscape, and we used this terminology 
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